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Abstract. The LASSO estimator is an ^i-norm penalized least-squares estimator, which was intro- 
duced for variable selection in the linear model. When the design matrix satisfies, e.g. the Restricted 
Isometry Property, or has a small coherence index, the LASSO estimator has been proved to recover, 
with high probability, the support and sign pattern of sufficiently sparse regression vectors. Under 
similar assumptions, the LASSO satisfies adaptive prediction bounds in various norms. The present 
note provides a prediction bound based on a new index for measuring how favorable is a design matrix 
for the LASSO estimator. We study the behavior of our new index for matrices with independent ran- 
dom columns uniformly drawn on the unit sphere. Using the simple trick of appending such a random 
matrix (with the right number of columns) to a given design matrix, we show that a prediction bound 
similar to [6, Theorem 2.1] holds without any constraint on the design matrix, other than restricted 
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QQ ' non-singularity. 

1. Introduction 



- 1—1 

X 



Given a linear model 



H 
on 

A' (1.1) y = Xf] + e 

^_» ' 

where X £ W ixp and e is a random vector with gaussian distribution A/"(0, cr 2 I) the LASSO estimator 

is given by 
^ (1.2) h = argmm^ R ^||y-X/?||i+ All/%. 

> ! 

fT) , This estimator was first proposed in the paper of Tibshirani |17| . The LASSO estimator /3 is often 
used in the high dimensional setting where p is much larger than n. As can be expected, when p 3> n, 

^j| ■ estimation of j3 is hopeless in general unless some additional property of /? is assumed. In many 
practical situations, it is considered relevant to assume that /3 is sparse, i.e. has only a few nonzero 

£-J components, or at least compressible, i.e. the magnitude of the non zero coefficients decays with high 

£»o . rate. It is now well recognized that the t\ penalization of the likelihood often promotes sparsity under 
certain assumptions on the matrix X. We refer the reader to the book [4 J and the references therein 
for a state of the art presentation of the LASSO and the tools involved in the theoretical analysis of 
its properties. One of the main interesting properties of the LASSO estimator is that it is a solution 
of a convex optimization problem and it can be computed in polynomial time, i.e. very quickly in the 
sense of computational complexity theory. This makes a big difference with other approaches based 
on variable selection criteria like AIC [I], BIC |16j . Foster and George's Risk Inflation Criterion |11| . 
etc, which are based on enumeration of the possible models, or even with the recent proposals of 
Dalalyan, Rigollet and Tsybakov |14) , [9], although enumeration is replaced with a practically more 
efficient Monte Carlo Markov Chain algorithm. 

In the problem of estimating Xj3, i.e. the prediction problem, it is often believed that the price to 
pay for reducing the variable selection approach to a convex optimization problem is a certain set of 
assumptions on the design matrix X Q One of the main contributions of |14| is that no particular 
assumption on X is required for the prediction problem, as opposed to the known results concerning 
the LASSO such that [3], [2], [6] and |19| . and the many references cited in these works. 

An impressive amount of work has been done in the recents years in order to understand the 
properties of /3 unded various assumptions on X. See the recent book by P. Buhlmann and S. Van de 
Geer [4] for the state of the art. Two well known assumptions on the design matrix are 



'Conditions for model selection consistency are given in e.g. |21| . and for exact support and sign pattern recovery 
with finite samples and p 2> n, in [BJ, |2U] 
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• small Coherence fi(X) 

• small Restricted Isometry Constant 5(X) 

where the Coherence p{X) is defined as 

fJ-(X) = mBx\XjXjf\, 
3,0' 

and the Restricted Isometry Constant S(X) is the smallest 5 such that 

(1.3) (1 - 8)\\Prh < WXrPrh < (1 + <*)ll#rll2 

for all subset T with cardinal s and all /9gR p . Other conditions are listed in [5]; see Figure 1 in that 
paper for a diagram summarizing all relationships between them. The Restricted Isometry property 
is very stringent and implies almost other conditions. Moreover, the Restricted Isometry Constant is 
NP-hard to compute for general matrices. On the other hand, the Coherence only requires of the order 
n *p(p — 1) elementary operations. However, It was proved in [6] that a small Incoherence, say of the 
order of 1/ log(p), is sufficient to prove a property very close to the Restricted Isometry Property: (|1.3|) 
holds for a large proportion of subsets T C {1, . . . ,p}, \T\ = s (of the order 1 — l/p a , a > 0). This 
result was later refined in [7] with better constants using the recently discovered Non-Commutative 
deviation inequalities [18]. Less stringent properties are the restricted eigenvalue, the ir represent able 
and the compatibility properties. 

The goal of this short note is to show that, using a very simple trick, one can prove prediction bounds 
similar to [21 Theorem 2.1] without any assumption on the design matrix X at the low expense of 
appending a random matrix with independent columns uniformly distributed on the sphere to X. 

For this purpose, we introduce a new index for design matrices, denoted by 7 S)/3 _ (X) that allows 
to obtain adaptive bounds on the prediction error without any assumption on the smallest singular 
value of submatrices with s columns. This index is defined for any s < n and p„ £ (0, 1) as 

(1.4) 7s,p-(X) = sup inf ||X}u||oo, 

veB(0,l) 7 c5 s , p _ 

where S StP _(X)) is the family of all index subsets S of {1, . . . ,p} with cardinal \S\ = s, such that 
&min(Xs) > P— ■ The meaning of the index 7 s ,p„ is the following: for any v S M n , we look for the most 
orthogonal to v "almost orthogonal" family inside the set of columns of X. 

One major advantage of this new parameter is that imposing the condition that 7 s ,p_ is small is much 
less stringent than previous criteria required in the litterature. In particular, many submatrices of X 
may be very badly conditioned or even singular without altering the smallness of j s . p _ . Computing 
the new index 7 s ,p_ (X) for random matrices with independent columns uniformly distributed on the 
sphere [], shows that a prediction bound involving "y s ,p- {X) can be obtained which is of the same 
order as the bound of [6] Theorem 2.1]. Then, using the trick of appending, to any design matrix, a 
random matrix with independent columns uniformly drawn on the sphere, which does not deteriorate 
the index J s ,p— > we extend our prediction bound to any design matrix. 

The plan of the paper is as follows. In Section [2] we present the index 7 S) p_ for X and provide an 
upper bound on this index for random matrices with independent columns uniformly distributed on 
the sphere, holding with high probability. Then, we present our prediction bound in Theorem 12.41 we 
give a bound on the prediction squared error ||X(/3 — /3||| which depends linearly on s. This result is 
similar in spirit to [6] Theorem 1.2]. The proofs of the above results are given in Section [3] In Section 
31 we show how these results can be applied in practice to any problem with a matrix for which 7 S) p_ 
is unknown by appending to X an n X po random matrix with i.i.d. columns uniformly distributed on 

3 

the unit sphere of M. n and with only po = 0{n^ p + ) columns. An appendix contains the proof of some 
intermediate results. 

1.1. Notations and preliminary assumptions. A vector j3 in R is said to be s-sparse if exactly 
s of its components are different from zero. Let p_ be a positive real number. In the sequel, we will 
denote by S s ^ p _ (X) the family of all index subsets S of {1, . . . ,p} with cardinal \S\ = s, such that for 
all S £ S SiP _, o- min (A"s) > p_. 



^or equivalently, post-normalized Gaussian i.i.d. matrices with components following A/"(0, 1/n) 
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2. Main results 

2.1. A new index for design matrices. 

Definition 2.1. The index n f s „_(X) associated to the matrix X is defined by 

(2.5) ls, P -{X) = sup inf HX)^^. 

oe-B(o,i) IcS ^p- 

Note that the function s t- )■ 7 s ,p_ (X) is clearly nondecr easing. An important fact is that the function 
-A i—)- 7 S p_ (X) is nonincreasing in the sense that is X' = [X, x] where x is a column vector in M. n , then 

7s,p_Po>7s,p-pn- 

Unlike the coherence p(X), for fixed n and s, the quantity 7 S)/9 _(X) is very small for p sufficiently 
large, at least for random matrices such as normalized Gaussian matrices as shown by the following 
proposition. 

Proposition 2.2. Assume that X is random matrix in M nxp with i.i.d. columns with uniform dis- 
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tribution on the unit sphere of W 1 . Let p_ and e € (0,1), C K £ (0, +oo) and pq £ {[e^],...,p}. 
Set 



K £ = ^((A + C K )\og(l + -\+C K + \og(^- 



(2-6) * = 7CT f (1+ ^S CK) ) log 2 (p )log(C K n), 



Assume that n, k and s satisfy 



max{Ks,2 x 36 x 3 x 3,exp((l - p_)/2)) 
(2.7) — < n < mm < 




Then, we have 

, ks + \n log(>o) + a/2w Kslog(po) + (fn log(p )) 2 

(2.8) 7s,„_P0 < V^ 



npo 



to£/i probability at least 1 — 8p n — p n — 20 n 2 p • 

Remark 2.3. Notice that the constraints 112. b)) and \2.7\) together imply the following constraint on s: 



S < c, " 



JSVarSltV log 2 (p )log(C K n) 
with 

c\\-pJf{\-ef C K 



a 



sparsity 



4e3 (1 + A e )2(l + C K ) 



2.2. A bound of \\X(/3 — /3)|| 2 based on 7 SjP _(X). In the remainder of this paper, we will assume 
that the columns of X are ^-normalized. The main result of this paper is the following theorem. 

Theorem 2.4. Let p_ £ (0, 1). Let u be a positive real such that 

P- O-min(^s) 



(2.9) Vlvn,p-{X)< 



n max TC{1 p} cr max (X T ) ' 

\T\<n 



Assume that s <vn. Assume that (3 has support S with cardinal s and that 

(2.10) > o-\B x ,v,p_ max a m&x {X T )^2a log(p) + log(2i/n) + y '(2a + 1) log(p) + log(2) 

1 TC{l,...,p} 

|T|<n 
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with 

VU Jun,p. (X) 



(2.11) B x ^ p _ 



P- <?min{Xs) ~ vn Jun,p-{X) max TC{li ... iP} o- max (X T ) 

|T|<n 



Then, with probability greater than 1 — p a , we have 

(2-12) \\\X0-ml < »Cn W -, V ,A 

with 



, 91 - r A + a v / (2a + 1) log(p) + log(2) / 

(2.13) C n „ p _ t0l)Vt \ = t— ^ oy 2a log(p) + log(2z^n) + A 

P- 0"min(^SJ V 

2.3. Comments. Equation (12. 9|) in Theorem 12.41 requires that 

Cmin(^s) 



(2-14) lun, P SX) < p_ 



un max TC{1 ,..., p} a max (X T ) 

\T\<n 



Proposition 12.21 proves that for random matrices with independent columns uniformly drawn on the 
unit sphere of R n (i.e. normalized i.i.d. gaussian matrices), 



. ks + on log(po) + \/2n Kslog(p ) + Un log(p )) 2 

(2.15) ls, P AX) < V2^ 



np 

3 

with probability at least 1 — 9p^ n . Thus, taking po of the order nA will ensure that (|2.14|) is satisfied 
with large probability for s up to the order n/\og(po), in the case of normalized gaussian matrices at 
least. The case of general design matrices will be studied in Section [U 

The main advantage of using the parameter 7^n,p_ (X) is that it allows X to contain extremely badly 
conditioned submatrices, a situation that may often occur in practice when certain covariates are very 
correlated. This is in contrast with the Restricted Isometry Property or the Incoherence condition, or 
other conditions often required in the litterature. On the other hand, the parameter "f un ,p- (X) is not 
easily computable. We will see however in Section [4] how to circumvent this problem in practice by 
the simple trick consisting of appending a random matrix with po columns to the matrix X in order 
to ensure that X satisfies (|2.14p with high probability. 

Finally, notice that unlike in [6^ Theorem 2.1], we make no assumption on the sign pattern of f3. In 
particular, we do not require the sign pattern of the nonzero components to be random. Moreover, 
the extreme singular values of X$ are not required to be independent of n nor p and the condition 
(|2.14l) is satisfied for a wide range of configurations of the various parameters involved in the problem. 



3. Proofs 
3.1. Proof of Proposition I2T21 

3.1.1. Constructing an outer approximation for I in the definition of r y StP _ . Take v S M n . We construct 
an outer approximation I of I into which we be able to extract the set /. We procede recursively as 
follows: until |7| = mm{KS,po/2}, for some positive real number k to be specified later, do 

• Choose jx = argmin„- =l! ___ po \(Xj,v}\ and set I = {jx} 

• Choose J2 = argmin J=1 pQ max{|(X,,i;)|, | (Xj , Xj 1 ) \ } and set I = I Li {j'2} 

• • • • 

• Choose j k = &rgmm j=lt ^ }0 m&yL{\(X j ,v)\,\(X j ,X jl }\,...,\(X j ,X jk ^ 1 }\} and set I = IU{j k }. 
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3.1.2. An upper bound on ||X£v||oo- If we denote by Zj the quantity \(Xj,v}\ and by Zi r \ the r th order 
statistic, we get that 

||^/w||oo = Z^y 

Since the Xj's are assumed to be i.i.d. with uniform distribution on the unit sphere of M. n , we 
obtain that the distribution of Zi r \ is the distribution of the r th order statistics of the sequence |X^i>|, 
j = 1, . . . ,pq. By (5) p. 147 |12| . \X!jv\ has density g and CDF G given by 

9(*) =^F(S)( 1 - z2 ) 2 andG(z)= 2 J g(Q d(. 
Thus, 

F Z(r) (z) = ¥(B>r) 

where B is a binomial variable £> (po, G(z)). Our next goal is to find a sufficiently small value zq of z 
satisfying 

(3-16) Fz (Ks) (z) > pv n . 

By the standard Chernov concentration inequality, we have that 

P (B > poG(z) + 5) < exp | ; j- | . 

~ m ~ P { 2( P0 G(z) + l)) 

Thus, taking 

ks = poG(z) + 5 

and 

5 2 

2 (poG(z) + |) 

will insure that the probability bound (|3.16|) is satisfied. Combining these two relations, we obtain 
that 

\2 



(ks-poG(z)Y 



3 

This gives 



> nlog(pn)- 
2 (2 Po G(z) + ks) ~ ^ P0 > 



^2 - 2 



(ks - p G(z)) > -n log(po) (2poG(z) + ks) . 

4 2 

«V - 2Ksp G(z) + p\G(zf > -n log(p )p G(z) + -n Kslog(p ). 

ks(ks- -n log(p ) J - 2 f ks + -n log(p ) J PoG(z) + PoG(z) 2 > °- 
which is satisfied as soon as 

ks + §n log(po) + \/2n Kslog(p ) + (In log(p )) 2 

G(z) > * . 

Po 

Notice that 



i r(§) , M ,, 



n — 3 

G(z) < 2—-^- / (1-C Z ) 2 <%, 



1 r 



< 2 -= / 2 { x 2 

- ^Fr(fifi) 

and since, by Garland's upper bound (see e.g. (2.8) in |13j). 

r(§) n-l 

r(2yl) " V2n^T 
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we obtain 

IT 

Therefore, the choice 



G(z) <2^ r ^=±=z< J — 

v ' ~ V 71 " \/2n-l ~ V 7T 



P^tks + 3 n log(po) + \/2n Kslog(po) + (^n log(p )) s 

^0 



2 V^po 

is an upper bound to the quantile for Ks-order statistics at level p^ n and we obtain that 

(3.17) F (HX^IU > z ) < Po n . 

3.1.3. Extracting a well conditionned submatrix of Xj. The method for extracting Xj from Xj uses 
random column selection. For this purpose, we will need to control the coherence and the norm of Xj. 
Step 1: The coherence of Xj. Let us define the spherical cap 

C(v,h) = {wGt"| (v,w) > h}. 

The area of C(v, h) is given by 

r2h-h 2 _j l 

Area(C(v,h)) = Area(S(0,l)) / t^(l-t)^dt. 

Jo 

Thus, the probability that a random vector w with Haar measure on the unit sphere 5(0, 1) falls into 

the spherical cap C(v, h) is given by 

C[v,h) 



(w eC(v,h)) 



5(0,1) 
/o 



Jo * 2 {l-t)*dt 



"1 n ~ 1 



£ t — (l-t)adt 

The last term is the CDF of the Beta distribution. Using the fact that 

F(X j eC(X f ,h)) = F(X f eC(Xj,h)) 

the union bound, and the independence of the Xj's, the probability that Xj G C(Xj/,h) for some 
(j,j') in {1, . . . ,po} can be bounded as follows 

f (ugy, =1 {x, e C(x /5 />)}) = f (u? <j/=1 {x 3 e C(x /; /»)}) 



Ml 



< E p ({^ e %»}) 



po 



£ E [F({X 3 €C(X f ,h)}\X f )} 

3<j'=l 

Po(Po-l) / ,2h_h2 »-i, 



Our next task is to choose /i so that 



2 jo 

Let us make the following crude approximation 



^<-- l) ^'tVd-t)** < tf 



po(po-1) f"\^ ^ . Po\ 



2 
Thus, taking 



t-zn-n- v _ 

/ tV(l-t)adt < ^(2/i)V(2/i-0). 



, . 1 / A , x , log(po)-log(2)) 

n > - exp -2 log(po) + 



2 ' \ V n + 1 
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will work. Moreover, since po > 2, we deduce that 

(3-18) n(X f ) < ^pf 

with probability at least 1 — p^ n . 

Step 2: The norm of Xj. The norm of any submatrix X$ with n rows and us columns of X has 
the following variational representation 

\\Xs || = max v t Xs r w. 

u61™, |M| = 1 

We will use an easy e-net argument to control this norm. For any v G M. n , v Xj, j £ S is a sub-Gaussian 
random variable satisfying 

P(|i7*Jfj-|>«) < 2exp (-en u 2 ) , 

for some constant c. Therefore, using the fact that \\w\\ = 1, we have that 



Y^ v l X s w 



> u I < 2exp (—en u ) 



Let us recall two useful results of Rudelson and Vershynin. The first one gives a bound on the covering 
number of spheres. 

Proposition 3.1. ( |151 Proposition 2.1]). For any positive integer d, there exists an e-net of the unit 
sphere of W 1, of cardinality 

f 2 xd ~ 1 
Id I 1 + - 

The second controls the approximation of the norm based on an e-net. 

Proposition 3.2. ([15l Proposition 2.2]). Let H be an e-net of the unit sphere ofM. d and let N' be 
an e' -net of the unit sphere of~R. d . Then for any linear operator A : M rf H > M rf , we have 

\\A\\ < sup \v l Aw\. 

(l-£)(l-e') „ 6Ar 
weAf' 

Let N (resp. A/ 7 ) be an e-net of the unit sphere of M KS (resp. of W 1 ). On the other hand, we have 
that 



sup \v t Aw\ >u\ < 2\N\\M'\ exp (-en u 2 ) , 

n-\-KS— 2 






< 8 uks I 1 H — I exp [—en u J , 

which gives 

. I nKS £2 ( ( 2 ( 2 

sup \v Awl > u \ < 8 t ttj exp — en u — (n + ks) log [ 1 -\ — 

veu / ( 2 + e) V V V e 

V.W&/V' / 

Using Proposition f|3.2[> . we obtain that 

(Ml > «) < P [ TT-yi SU P l v *^H ^ n 
{l — e) „ gA/ - 



Thus, we obtain 



(||X s ||>u) < 8 " Kg £ . 9 exp ( - (en (1 - e) 4 n 2 - (n + «a) log ( 1 + - 
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To conclude, let us note that 



\Xf\\ >u) < P max \\X S \\ >u 

1 SC{l,...,p > 
\S\=KS 



£ (P°) 8 £plSL eX p (- (en (1 - ef u 2 - (n + ks) log (l + - 



ksJ (2 + e) 2 "V V ,.^,~ G ^, £ 



and using the fact that 



one finally obtains 



Po\ < fep \'' 



KS J \ KS 



P(||X f || > «) < 8 exp (- (en (1 - e) 4 u 2 - (n + Ks)log (l + -\ -ks log (^) -log (j?^X\\ 
The right hand side term will be less than 8pQ H when 



4 2 / m _ / 1 , 2^ i„„/ e Po\ i__( nKS£ 



n log(po) < en (1 — e) u — (n + ks) log I 1 -\ — — ks log I ) — log 



ks J \(2 + e) 

This happens if 

9 ^ 1 / log(po) /-, ks\, /-, 2\ ks , fepo\ 1, ( nKS e 2 

u 2 > — rr n &yF J + 1 + — log 1 + - + — log — — J + - log ' 



c(l — e) 4 \ n \ n J \ e ) n \ ks ' n \(2 + 1) 2 

Notice that 

, n .^n /, ks\ , / 2\ ks , / e \ 1, / nKS e 2 

(3.19) i + _ bg l + _ +_log — +-loj 



nJ \ ej n \ks/ n \(2 + e) 

< (1 + C K ) log (l + |) + C K + I log /an ' 



<^ £ 



6 



since n > 1. Now, since 



we finally obtain 



6 n + ks 

< log(jpo) < log(po), 



27r n 



™ A, „ ,, 1 + iO n + ks , . ,\ 8 

(3-20) P^,|>- ?I -^-_logta,)j < -. 

Step 3. We will use the following lemma on the distance to identity of randomly selected subma- 
trices. 

Lemma 3.3. Let r £ (0, 1). Let n, k and s satisfy conditions A2.7\) and A2.6\) assumed in Proposition 
\2.2\ Let £ C {1, . . . , ks} be a random support with uniform distribution on index sets with cardinal s. 
Then, with probability greater than or equal to 1 — 8p^ n — p^ n on X, the following bound holds: 

(3.21) P(pCf;X E -IdJ >r\X) < 1. 

Proof. See Appendix. □ 

Taking r = 1 — p_, we conclude from Proposition 13.31 that, for any s satisfying (|2.9p . there exists a 
subset I of / with cardinal s such that 



X I * P 
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3.1.4. The supremum over an e-net. Recalling Proposition 13. 1| there exists an e-net N covering the 
unit sphere in W 1 with cardinal 

/ 2 \ n-l 
\AT\ < 2n(l + - 

Combining this with (|3.17p . we have that 



KS+?" 



, ,. I,,.; n ,— • •■ i 4™ log(po)+\/2n Kslog(p )+(§n log(p )) 2 

sup v ^mf IcSs , p _ \\X}v\\> y^ VEV 



(3.22) <2n(l + f) n " 1 (l + l + 8)^ n . 

3.1.5. From the e-net to the whole sphere. For any v' , one can find v £ N with \\v' — f || 2 < £ - Thus, 
we have 

H-Xjt/Hoo < ||X>|| 0O + \\XJ(v' - v)^ 

< WXJvWoo + m&xKXj^v' -v)}\ 

< llXf^llcxj + max [|Xj-||2[|u' — v\\2 

(3.23) < \\XJv\loo + e. 
Taking 



ks + §n log(po) + y 2n Kslog(p ) + (f« log(p )) s 



2 \/npo 

we obtain from (pT23|) and ([332]) that 



m / . r II w II ^ /7T" KS +§ n 1o s(po)+a/271 Kslog(p )+(§n log(po)) 2 

P I su P ||„|| 2=1 inf 7c5BiP _ \\XjvW > V2tt VEVo 

<20n(l + f) n - 1 po n - 
On the other hand (after omitting the terms involving the variable s), we have that 

> 1 /iL V^ lo s(^o) 



3V 2 po 

and we obtain that 



. r II xW 11 ^ ,7— KS+|n log(p )+,/2?i Kslog(p )+(§n log(po)) 2 

sup||„|| 2 =i inf /c5sip _ \\X}v\\ > V2tt VEV 
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Since Po/l°g(Po) > y/n an d log(po) > "T^j we thus have 



■ r \\vt II ^ /7T- KS +l n l°g(Po)+\/2n Kslog(p )+(|n log(p )) 2 

sup||„|| 2=1 mi IcSsp _ \\X}v\\ > V2vr * 



< 20 n ($. j Po -«. 
This gives 



n-l 



m / . . || vt || . nr- KS+^n Iog(p )+J2n Kslog(p )+(|n log(p )) 2 

P I su P ||„|| 2=1 inf /c5siP _ ||X*V|| > V^ ^^^ 

-n+3 1 

< 20 ra^T" Pq 1 
as announced. 
3.2. Proof of Theorem E3J 
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3.2.1. Optimality conditions. The optimality conditions for the LASSO are given by 

(3.24) -X t (y-Xp) + \g = 
for some g 6 <9(|| ■ ||i)a- Thus, we have 

(3.25) X*X0 - p) = X l e - Xg. 

from which one obtains that, for any index set Ic{l,...,p} with cardinal s, 



(3.26) 



X* X X(P - $) 



i x + \\xU\ 



3.2.2. The support of j3. As is well known, even when the solution of the LASSO optimization problem 
is not unique, there always exists a vector /3 whose support has cardinal n. 

3.2.3. A bound on \\XjXs(Ps — $s)\\oo- The argument is divided into three steps. 
First step. Equation (|3.26p implies that 

(3.27) X^XsiPs-Ps) < X+\\X t I e\\ oo + XxX§ n s°(P§nS°- Psns°) 

oo 
Second step. We now choose I as a solution of the following problem 

■& = min maxMX^X&^oJff&r.ar. — J3$ nS c))\ 



\l\=f 



subject to 

By Definition EH 

and thus, 



O-min(Xl) > p-. 

< ls, P AX)\\X §nSC (l3 §nSc -P §nSc )\\2 

$ < 7s, P -{X) o- mBX (X SnSc )\\/3 SnSc - P§ nS ch 
S 1s,p-{X) o~ majX {Xg nSC )\\pg nSC — Pg n g c \\i 
which gives 

(3-28) < Js, fi .(X)a max {X SnSc )(\\P Sn3c \\ 1 + \\P SnSC \\ 1 

Third step. Combining (J3.27P and f|3.28|) . we obtain 

X^XsiPs -fa) < A + ||A*e|| + ls , p _(X) a max (X §nSc ) 

oo 

Using the fact that 

(3.29) \\xy\\ < ll^ell 

v ' II x Moo — II lloo 

and since 



sns c 



|i + ll^sns«=Hi 



|X*e|L > o-^/Iol log(p) + log(2p) < p 



we obtain that 



X\X 8 {Ps-h) < A + <V(2a + 1) log(p) + log(2) 

oo 

(3-30) +"f StP _(X) a max (X SnSc ) (||/?5 n5 c||i + ll/^nsdli 

with probability greater than 1 — p a . 
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3.2.4. A basic inequality. The definition of /3 gives 

I|| 2/ -X/3||| + AH/% < l\\y-XP\\l + 
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Therefore, we have that 



he-X0-0)\\2 + 



i < -j\\ £ h + A ' IA ' |M 



which implies that 

1„ 



X(0-P)\\2 < (e,Xstfs-Ps)) + {e,X § ^ e (p* nSe -l3* nSo )) 



+A (\\Ps\\x-\\Ps\\i -AII^^IU + AH^^IIi. 



This can be further written as 



(3.31) 
(3.32) 



1 



X0-P)\\l < (£,Xs0S-Ps)) + {Xl nSo eJ$ nS °-PSnS°) 



+A H^lli-ll^lli -A||^ nSc ||i + A[|/3 5nS<J ||i. 



3.2.5. Control of {e,Xs(Ps — As))- The argument is divided into two steps. 
First step. We have 

{e,X s s -Ps)) = (XleJs-Ps) 



< \\Xle\\\\(3s ~ Psh 



loo 



and, using the fact that cr m [ n (Xx) > p-, 
(e,X s 0s-Ps)) < 



< V~s \\X},4oo\\Ps-Psh 

X t s e\\\\X^Xs0s-Ps)\\oo. 



p- &m.m(Xs 
Second step. Since the columns of X have unit (,2-^o^vn, we have 



X* s e\\ > a^/2a log(p) + log(2a) < p~ a 



which implies that 



(3.33) (e,X 8 Vs-f>8)) < "^ ^ + ^ \\X* X X 8 fe - ft 



with probability at least 1 — p a . 

3.2.6. Control of (X^s J 

(3-34) (Xl nsc e,(3 § 

On the other hand, we have 



P- Cmin(A^ 



§nsc^rsnS°-P$nS°)- We have 



sns c£, Psns c Psns c > — 



X snsc £ 



J sns c 



PsnS c W lm 



X lnS- £ 



> ayj2a log(p) + log(2(p - s))) < p~ a , 
which, combined with (|3.34p . implies that 

( X $nS* £ >P§nS--P§nSc) ^ a ^ 2a lo gW + lo g( 2 (P = s )) (\\Psns°h + H^nS«=lli 
with probability at least 1 — p~ a . 
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3.2.7. Control of \\f3s\\i — ||As||i- The subgradient inequality gives 

WPsh - WPsh > (sign((3 s )Js-Ps). 
We deduce that 

ll^lli-Pslli < ll-sign^lLII^-^lli 

< ^r-AXzXs0s - Ps)h 

which implies 

(3-35) ||/3s||l-||^||l < S - 71F t {{XiXstfs - p s )\\oo- 

3.2.8. Summing up. Combining (|3.3ip with (|3.33p . (|3.35|) and (j3.30p . the union bound gives that, with 
probability 1 — 3p~ a , 

1 



~\\X((3 - (3)\\ 2 2 < — - (ay/2a log(p) + log(2s) + A A + a y / (2a + 1) log(p) + log(2) 

2 p- o- m [ n (X S ) V / V 



+ls,p^(X) cr max (X^ n5c ) (||/% n5 c ||l + ||/2§ n sc||l 
+a v / 2a log(p) + log(2(p - s)) (\\0 S nS°h + ll^nS'Hi) 



+A 



'SnScW 1 ' WPsnS^W 1 



which gives, 



l|,w« m ,|2 ^ A + (7^(2^ + 1) log(p) + log(2T / / — —————— - \ 

-||X(/3-/3)|| 2 < s — — o-V2a log(p) + log(2s) + A 

2 /9_ (7m in Ac V / 



+ tttt f7^2a log(p) + log(2s) + A j s , p _(X) a max (X §nSc ) 

\ P— 0~mm\^*-S) ^ ' 



-<ry/2a log(p) + log(2(p - sj) - A ] ||^ nSc ||i 
s 

P- 0' m in(A"c 



+ 1 _ * r v \ (W 2 « log(p) + log(2s) + A) 7,, p _(X) a max (X §nSc ) 



+o^/2a log(p) + log(2(p - sj) + A J \\/3 §nSc ||i. 
Using the assumption that s < fn, we obtain 



Wfl mi|2 ^ A + a V(2a + 1) log(p) + log(2) / , - x 

-||X(/3-/3)|| 2 < s — — [ay/ 2a log(p) + log(2i/n) + A 

2 p_ o- min (A5j V / 

(7/T1 / \ 
ttt^ <J\/2a log(p) + log(2im) + A 7 S , P _ (A") cr max (A^ 5c 
P- 0"min(,A S J V / 



+aV(2a + 1) log(p) + log(2) - A ||^ nSc ||i 



//» 



+ ttt^ (o-\/ 2 « log(p) + log(2im) + A) 7 S , P _ (A) cr max (A^ 5C 

\ P- 0"min(,A S j V / 



+av/2a log(p) + log(2(p)) + A ||^ nSc ||i. 
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Since, as recalled in Section 13.2.21 the support of ft has cardinal less than or equal to n, we have 

o"max(^"s n<?c ) < max a max (X T ), 

ini rc{i ri 

\T\<n 

and the proof is completed. 

4. A SIMPLE TRICK WHEN 7 S)P _ IS UNKNOWN: APPENDING A RANDOM MATRIX 

We have computed the index r y s , p _ for the random matrix with independent columns uniformly 
distributed on the unit sphere of M n in Theorem 12.21 The goal of this section is to show that this 
result can be used in a simple trick in order to obtain prediction bounds similar to [6j Theorem 2.1] 
without conditions on the design matrix X. 

This idea is of course to use Theorem 12.41 above. However, the values of cr m m{Xs) and o~ max (X a nS c) 
are of course usually not known ahead of time and we have to provide easy to compute bounds for 
these quantities. The coherence fJ-(X) can be used for this purpose. Indeed, for any positive integer 
t < p and any T C {1, . . . ,p} with \T\ = t, we have 

n(x) = \\x l x-L\\ ltl 

= max max w (X X — I)w' 

|M|oo=l ll u, 'lli = i 

> — — max max w (X X — I)w' . 

\Jt IMl2 = 1 lKll2 = l 

IM|o=t||w'|[o=* 
Thus, we obtain that 

\-li(X)sTt < a min (X T ) < <r max (X T ) < l + n(X)Vt. 

However, the lower bound on o~ m i n (Xs) obtained in this manner may not be accurate enough. More 
precise, polynomial time computable, bounds have been devised in the litterature. The interested 
reader can find a very useful Semidefinite relaxation of the problem of finding the worst possible value 
of <7 m i n (Xr) over all subsets T of {1, . . . ,p} with a given cardinal (related to the Restricted Isometry 
Constant) in [10] . 

Assuming we have a polynomial time computable a priori bound 0"j^ in on <7 m i n (Xr) (resp. 0"j^ ax 
on max TC{1 p} c m ax(^T))> our main result for the case of general design matrices is the following 

\T\<n 

theorem. 

Theorem 4.1. Let X be an matrix in M. nxp with H.2- norma li ze d columns and let Xq be a random 
matrix with independent columns uniformly distributed on the unit sphere ofW 1 . Let X* denote the 
matrix corresponding to the concatenation of X and Xq, i.e. X§ = [X, Xq]. Let f3$ denote the LASSO 
estimator with X replaced with X$ in hi. 2(1 . Let p_ G (0, 1). Let v be a positive real. Assume that po 
is such that 



ks + ^u logOo) + \/2n Kslog(p ) + Un log(p )) , 

(4.36) V2tt 1 = < Lp_- 

y/n po 

for some L G (0, 1). Assume moreover thatpo is sufficiently large so that the second inequality in §2_ 
is satisfied. Assume that (3 has support S with cardinal s and that 



A > a [B' Xup < ax \/2a log(p + Po ) + \og(2vn) + V '(2a + 1) log(p + p ) + log(2) 
with 

(a -17) R> vn lvn . p _{X) 

1 j X ^ P - p- a^ n - vn lvn ^{X) a^ 

Assume that s satisfies the first inequality in 112.7]) and that s < vn. Then, with probability greater 
than 1 — p~ a — Sp^ n — p^ n — 20 n 2 p~ , we have 

(4-38) h\x0#-f3)g < >C,^ 
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with 



n , A + a V(2a + 1) log(p + g + log(2) / , 

C n,p,p_,a,^,A = — -; ^V 2 " Iog(P + Po) + log(2im) + A 



J n,p,p-,a,u,\ 

r— u min 

Proof. Since the index J s ,p- does not increase after appending a matrix with £2- normalized columns, 
the matrix X# has at most the same index as that of Xq. Then f)4.36|) ensures that the index 7 S)/J _(.X#) 
is sufficiently small. The rest of the proof is identical to the proof of Theorem 12.41 □ 

Appendix A. Proof of Lemma [3731 
For any index set S C {1, . . . , ks} with cardinal s, define Rs as the diagonal matrix with 

(Rs)u = / lif * G5 ' 

otherwise. 



XiXs-lll = \\R S HRs\ 



Notice that we have 

with H = X t X — I. In what follows, Rs simply denotes a diagonal matrix with i.i.d. diagonal 
components 8j, j = 1, . . . , ks with Bernoulli B(l, 1/k) distribution. Let R' be an independent copy 
of R. Assume that S is drawn uniformly at random among index sets of {1, ... , ks} with cardinal s. 
By an easy Poissonization argument, similar to [6, Claim (3.29) p. 2173], we have that 

(A.39) F(\\R S HR S \\ > r) < 2 F(\\RHR\\ > r) , 

and by Proposition 4.1 in [7], we have that 

(A.40) F(\\RHR\\>r) < 36 P (\\RHR'\\ > r/2) . 

In order to bound the right hand side term, we will use [7] Proposition 4.2]. Set r' = r/2. Assuming 
that k— > u 2 > -IIXII 4 and v 2 > -||X|| 2 , the right hand side term can be bounded from above as 

e — — k 11 11 — k 11 11 ' ° 

follows: 

(A.41) ¥(\\RHR'\\>r') < 3 ks V(s, [r',u,v]), 

with 

lu 2 \^ / l||M|| 4 \ u2/l|M|12 / 1I|M||^^ (M)2 



V(s,[r',u,v]) = e--^ + e~^H + e- 



Kr' 2 J \ K U 2 J \ K V 2 



Using (|3.18p and (|3,20p . we deduce that with probability at least 1 — 8p Q n — p Q n , we have 

4 " 2 

,„ r , n ( lu 2 \£ ( 1 (#F^l°g(p )) ^ (^^'^ 



K V z 



2 

2\ T^Z 



+ \e- 



Take k, u and v such that 



2 /2 1 

v = r 

log(C K n) 

2 n ( l + K e n + KS 

\c(l — ep n 

3 Cv / 1 + K £ n + ks 
V 2 \c(l-e) 4 rT 



2 



K > e -^ -7i -TJ 10g(po) 



2 
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for some Cy possibly depending on s. Since ks < C K n, this implies in particular that 



(A , 2) „ > *% ((i±^i±^) log(ro) 



]ntr(rin\ 1 

r- \ c{l-e) 4 



Thus, we obtain that 



log(C K n) / ,2 \ °V n , n nn , 2r L P '\ 



v(-,K,u,t,]) - (^j +^j +v e2Cv 



Using (jA~39|) . (1X401) and (jA~4ll . we obtain that 
P (\\R S HR S \\ >r') < 2 x 36 x 3 x ks 



2r' 2 - 2 



/ / -, \ log(C K n) / ,2 \ C V . . _fI_ZL 

1 \ &v ' I r \ ( log(C K n) \ l °s(c K n) 



'V 



e 2 y \e 2 C 2 / 'V e 2 C v 



Take 

(A.43) C v = log(C K n) 

and, since po > 1 and r G (0, 1), we obtain 
¥(\\R s HR s \\>r') 

log(C K n) 

(A.44) < 2 x 36 x 3 x ks I ( ^ 1 + 




r' 2 N 


log(C K n) 

) + 


V e 2 y 


o /2 2 
2r p 

log(C K n) 


e 2 log 2 (C«n) . 


) 



Replace r' by r/2. Since it is assumed that n > exp(r/2)/C K and po ^ v21og(C K n)/r, it is sufficient 
to impose that 

i 
C 2 K n 2 > (2x36x3xksx3)« ? ), 

in order for the right hand side of (|A.44|> to be less than one. Since ks < C K n, it is sufficient to impose 
that 

C 2 n 2 > 2 x 36 x 3 x C K n x 3, 

or equivalently, 

C K n > 2 x 36 x 3 x 3. 

This is implied by (|2,7p in the assumptions. On the other hand, combining (JA.42P and (JA.43P implies 
that one can take 

4e 3 ( (1 + K S )(1 + C K ) \\ 2 
K = 7T ^ c (i- g )4 J log (Po)log(C re n), 

which is nothing but fj2.6|) in the assumptions. 
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